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ABSTRACT 

We analyze software reuse from the perspective of informa- 
tion theory and Kolmogorov complexity, assessing our ability 
to "compress" programs by expressing them in terms of soft- 
ware components reused from libraries. A common theme 
in the software reuse literature is that if we can only get the 
right environment in place — the right tools, the right gener- 
alizations, economic incentives, a "culture of reuse" — then 
reuse of software will soar, with consequent improvements 
in productivity and software quality. The analysis developed 
in this paper paints a different picture: the extent to which 
software reuse can occur is an intrinsic property of a problem 
domain, and better tools and culture can have only marginal 
impact on reuse rates if the domain is inherently resistant to 
reuse. We define an entropy parameter H G [0, 1] of prob- 
lem domains that measures program diversity, and deduce 
from this upper bounds on code reuse and the scale of compo- 
nents with which we may work. For "low entropy" domains 
with H near 0, programs are highly similar to one another 
and the domain is amenable to the Component-Based Soft- 
ware Engineering (CBSE) dream of programming by compos- 
ing large-scale components. For problem domains with H 
near 1, programs require substantial quantities of new code, 
with only a modest proportion of an application comprised 
of reused, small-scale components. Preliminary empirical re- 
sults from Unix platforms support some of the predictions of 
our model. 
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1. INTRODUCTION AND OVERVIEW 

Software reuse offers the hope that software construction 
can be made easier by systematic reuse of well-engineered 
components. In practice reuse has been found to improve 
productivity and reduce defects (3| [T^l [T5] [HI |33) . But what 
of the limits of reuse — will large-scale reuse make software 
construction easier? Thinking here is varied, but for the sake 
of argument let me artificially divide the opinions into two 
competing hypotheses. First the more enthusiastic end of the 
spectrum, which I associate with the Component-Based Soft- 
ware Engineering (CBSE) movement. 

Hypothesis 1 (Strong reuse). Large-scale reuse will allow 
mass-production of software, with applications being assembled 
by composing large, pre-existing components. The activity of 
programming will consist primarily of choosing appropriate com- 
ponents from libraries, adapting and connecting them. 

Strong reuse is thought to thrive in problem domains with 
great concentration of effort and similarity of purpose, i.e., 
many people writing similar software whose requirements show 
only minor variation. However, the question of whether 
strong reuse can succeed for software construction considered 
globally, across disciplines and organizations, remains uncer- 
tain. A more cautious view of reuse is the following. 

Hypothesis 2 (Weak reuse). Large-scale reuse will offer use- 
ful reductions in the effort of implementing software, but these 
savings will be a fraction of the code required for large projects. 
Nontrivial projects will always require the creation of substan- 
tial quantities of new code that cannot be found in existing com- 
ponent libraries. 

Representative of weak reuse thinking is the following pre- 
scription for code reuse in well-engineered software from Jef- 
frey Poulin 1231 : up to 85% of code ought be reused from 
libraries, with a remaining 15% custom code, written specif- 
ically for the application and having little reuse potential. 
The percentage of code that may be reused from libraries 
varies greatly across problem domains, but weak reuse paints 
a fairly accurate picture of the software landscape of today. 
Many explanations are proposed for why strong reuse is not 
happening on a global scale (cf. (5J). A common position in 
the reuse literature is that if we can only get the right environ- 
ment in place — the right tools, generalizations, economics, 
a "culture of reuse" — then reuse of software will soar, with 
consequent improvements in productivity and software qual- 
ity. 



A contrary view. The perspective developed in this paper 
suggests that the extent to which reuse can happen is an in- 
trinsic property of a problem domain, and that improving the 
ability of programmers to find, adapt, deploy, generalize and 
market components will have only marginal impact on reuse 
rates if the domain is resistant to reuse. We propose to asso- 
ciate with problem domains an entropy parameter < H < 1 
measuring the diversity of a problem domain. When H — 1, 
software is extremely diverse and we should expect very little 
potential for reuse; in fact, we show that the proportion of an 
application we can draw from libraries approaches zero for 
large projects. For problem domains with H < 1, software 
is somewhat homogeneous, and with decreasing H comes in- 
creasing potential for reuse. The theory we develop suggests 
that an expected proportion of at most (1 — H) of an appli- 
cation's code may be reused from libraries, with a remaining 
proportion H being custom code written specifically for the 
application. As H nears we enter the strong reuse Utopia of 
"programming by composing large components." The possi- 
bilities of reuse are strictly limited by the parameter H, which 
is an intrinsic property of the problem domain. 

We develop this theory by examining our ability to com- 
press or compactify software by the use of libraries. We shall 
speak throughout this paper of compressed programs, by which 
we mean programs written using libraries, and uncompressed 
programs that are stand-alone and do not refer to library com- 
ponents. The principle tools we employ are information the- 
ory and Kolmogorov complexity. Both of these carry subtly 
different notions of compressibility that we shall have to jug- 
gle. The information theory notion deals with compressing 
objects by identifying patterns that appear frequently and giv- 
ing them short descriptions — as in English we have taken to 
saying "car" for "automobile carriage." The Kolmogorov ver- 
sion of compressibility describes our ability to find for a given 
program a shorter program with the same behaviour, without 
appealing to how typical that program might be for the prob- 
lem domain within which we are working. We assume some 
basic familiarity with information theory as might be found in 
e.g. (7| Ch. 2] or 1191 . The essentials of Kolmogorov com- 
plexity are reviewed in Section^] 

Library components and prime numbers. Integers factor 
into a product of primes; software can be factored into an 
assembly of components. Library components are the prime 
numbers of software. This would be a terribly naive thing to 
say were it not for the many wonderful parallels that turn up: 

• There are infinitely many primes; in Section 15.2.11 we 
prove there are infinitely many components for a prob- 
lem domain that reduce expected program size (thus 
guaranteeing employment for library writers.) 

• The n prime is a factor of ~ - ^ of the integers. The- 
ory predicts the n th most frequently used library compo- 
nent has an ideal reuse rate of about —. -, — t— (Sec- 

n log n log ' n 

tionRTB. 

• The Erdos-Kac theorem states that the number of factors 
of an integer tends to a normal distribution; we mea- 
sure experimental data that suggests a similar theorem 
might be provable for software components (Figure^}. 

• The Prime Number Theorem states that the n th prime is 
~ log(nlnn) bits long. We show that the ideal configu- 
ration for libraries is that the n th most frequently used 
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Subroutines ordered by use frequency (n) 

Figure 1: Data collected from shared objects on several 
unix platforms, showing the number of references to li- 
brary subroutines. The observed number of references 
shows good agreement with Zipf-style frequency laws of 
the form c • n _1 (dotted diagonal lines). A detailed expla- 
nation of this data is given in Section|51 

component is of size > log n and < i^p-o(n 6 ) for e > 
(Section l5~2~2l . 

Reuse and Zipf 's Law. It is known that hardware instruction 
frequencies follow an iconic curve described by George K. Zipf 
for word use in natural languages 1161 1181 1271 . Zipf noted 
that if words in a natural language are ranked according to 
use frequency, the frequency of the n th word is about n _1 . 
Zipf-style empirical laws crop up in many fields 1 24l l21l . Evi- 
dence suggests programming language constructs also follow 
a Zipf-like law (5| 1171 . It is natural then to wonder if this re- 
sult might extend to library components. Our results support 
this conclusion. Figure shows the reuse counts of subrou- 
tines in shared objects on three Unix platforms, clearly show- 
ing Zipf-like n _1 curves. These results are described in detail 
in Section [5] The appearance of such curves is not happen- 
stance. In Section l4!2l we argue they are a direct result of pro- 
grammers trying to write as little code as possible by reusing 
library subroutines; this drives reuse rates toward a "maxi- 
mum entropy" configuration, namely a Zipf's law curve. 

1.1 Organization 

The remainder of this paper is organized as follows. Sec- 
tion |21 introduces an abstract model of software reuse from 
which we derive our results. In Section [3] we give a brief 
overview of Kolmogorov complexity. In Section [4] we derive 
bounds on the rates at which software components may be 
reused, and give an account for the appearance of Zipf-style 
empirical laws. Section|5]examines the potential for software 
reuse as a function of the parameter H. In Section |6| we 
present some preliminary experimental results, and Section[7| 
concludes. 

2. MODELLING LIBRARY REUSE 



In this section we propose an abstract model capturing some 
essential aspects of software reuse within a problem domain. 
The basic scenario is this: we have a library, possibly many 
libraries that we collectively consider as one, that contains 
a great number of software components. These components 
may be subroutines, architectural skeletons, design patterns, 
generics, component generators, or whatever form of abstrac- 
tion we may yet invent; their precise nature is unimportant 
for the argument. In using a component from the library 
we achieve some reduction in the size of the program, and 
perhaps consequently, in the effort required to implement it. 
Program size serves as a rough lower bound to effort, but it 
would be a grave error to confuse the two. 

2.1 Distribution of programs in a domain 

We presume that the projects undertaken by programmers 
working in a problem domain can be modelled by a proba- 
bility distribution on programs. The probability distribution 
is defined on "uncompressed" programs that do not use any 
library components. These uncompressed programs can be 
viewed as specifications that programmers set out to realize. 

We consider compiled programs modelled by binary strings 
on the alphabet {0, 1}. We write \\w\\ for the length of a string 
w. Finite programs are countably infinite in number, so we 
immediately encounter the problem of defining a probabil- 
ity distribution in which the probability of encountering indi- 
vidual programs may be infinitesimal. A rigorous approach 
would be to employ measure theory, for example Loeb mea- 
sure, which would allow us to speak of the probability of in- 
dividual programs. This would require some rather daunt- 
ing machinery and we instead settle for a more accessible ap- 
proach similar to that used by 1 4 6 25 1 . 

Let A- n = {w e {0,1}* : \\w\\ < n} denote compiled 
programs of length at most n bits. We introduce a family of 
conditional distributions {p So }s eN whose domains consist of 
programs < so bits in size, that is, 

and satisfying Yl Ps a = 1 an d Ps (w) > 0. The intent is 
that Ps (w) gives the probability that someone working in 
the problem domain will set out to realize the particular (un- 
compressed) program w, given that w is at most so bits long. 
For this family of distributions to be compatible with one an- 
other we require that p so (w) = p SQ+ i(w \ \\w\\ < so), i.e., we 
can get the distribution on length < so programs by taking a 
conditional probability on the distribution for length s () + 1 
programs. We do not presume that such distributions can be 
effectively described. 

In what follows we use the usual notation for expectation 
with the implied assumption of so — > oo; for example, if / : 
{0, 1}* — > R maps programs to real numbers, then by E[f(w)] 
we mean: 

E[f(w)] = lim Y, /Hp«dH 

sq— too — ^ 

w. || w || <so 

if such a limit should exist. For example, a mean program 
size S[||io||] may exist for a problem domain, but we do not 
require nor expect this. 

2.2 The entropy parameter h 

A key, perhaps defining, feature of a problem domain is that 
there is similarity of purpose in the programs people write. 
We do not expect the distribution of programs written in a 



problem domain to be uniform over all possible programs, but 
rather concentrated on programs that solve certain classes of 
problems typical for the domain. We formalize this intuition 
by introducing a parameter H for problem domains measur- 
ing how far their probability distribution departs from uni- 
form. This H is very similar to entropy rate from information 
theory, and coincides if we are willing to assume programs 
are drawn from a stationary stochastic process. When H — 1 
the distribution over programs is uniform, modelling extreme 
diversity of software, with little opportunity for reuse. For 
H < 1 there is some potential for reuse. In fact as we shall 
see shortly, we may expect that up to a proportion 1 — H of 
programs may be reused from libraries. 

Define the entropy of each distribution p SQ in the standard 
way (see, e.g., 0E3): 

w. || w || <sq 

This is the expected number of bits required to represent a 
program of size < so in this domain. We are interested in the 
limit behaviour of | A < aQ ; H(p SQ ), akin to the entropy rate of a 

random process. In general this limit may not exist — there 
might be oscillations — so we need some weaker notion of 
limit. We settle for a limsup, which gives an almost sure upper 
bound on the limit behaviour. 

Definition 1 (Entropy parameter). Define the entropy pa- 
rameter H of a problem domain to be the greatest value that 
\ iA <s ^ H{ps ) attains infinitely often as so — > oo: 

As a consequence of this definition we are guaranteed that 
H(Pb ) < sqH almost surely as so — > oo. 

We cannot hope to calculate H from first principles except 
for toy scenarios, but there is hope we might estimate it em- 
pirically. We introduce H primarily as a theoretical tool to 
model problem domains in which people have great similar- 
ity of purpose (H — > 0) or diffuse interests (H — » 1). The 
main impact of H is the following. 

Claim 2.1. In a problem domain with entropy parameter H, 
the expected proportion of code that may he reused from a li- 
brary is at most 1 — H. 

This is a consequence of the Noiseless Coding Theorem of 
information theory (e.g., Q §2.5]), which states that coding 
random data with entropy H requires (on average) at least 
H bits. Suppose an uncompressed program has size s < so. 
We defined H so that H(p so ) < sH almost surely, so we can 
compress programs to an expected size of at best sH by the 
Noiseless Coding Theorem. Therefore the expected amount 
of code saved by use of the library is at most (1 — H)s, and it 
is reasonable to equate this with the amount of code reused 
from the library. An immediate implication is that blanket 
reuse prescriptions such as "effective organizations reuse 70% 
of their code from libraries" are unrealistic; reuse goals need 
to be pegged to the problem domain's value of H. 

Figure [2] illustrates the scenario we consider in this paper: 
programmers set out to implement the capabilities of some 
uncompressed program of length s written without use of a 
library, drawn from the distribution for the problem domain. 
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Figure 2: The basic scenario: programmers in a problem 
domain set out to realize a program that can be repre- 
sented in s bits when compiled without the use of a li- 
brary. By using library components, they are able to re- 
duce the size of the compiled program, down to an ex- 
pected size of > Hs bits. 

A programmer implements the program making use of the li- 
brary, effectively "compressing" it. The expected size of the 
compressed program is at least Hs bits, by the previous argu- 
ments. The library consists of a set of components, each with 
an identifier or codeword by which they are referred to. We 
always take programs to be compiled, so as not to care about 
the high compressibility of source representations. 

2. 2. 1 Motifs and the AEP 

One question we should like to answer is whether when 
H < 1 there are commonly occurring patterns or "motifs" 
in programs that we can put in libraries and reuse to com- 
press programs. If we are willing to assume that programs 
in a problem domain behave as if excerpted from a station- 
ary ergodic source, then the Shannon-McMillan-Breiman the- 
orem (asymptotic equipartition property or AEP) [7 §15.7] 
ensures that when H < 1 there are commonly occurring finite 
subsequences in programs that can be exploited, and indeed 
that we can achieve optimal compression of programs merely 
by having libraries of common instruction sequences. That 
more complex software components prove necessary in prac- 
tice suggests the stationary ergodic assumption is too strong, 
and a weaker ergodic property is needed to account for the 
emergence of motifs in software when H < 1. It is unclear 
yet exactly what this property might be; in the remainder of 
this paper we do not assume AER 

2.3 Libraries maximize entropy 

A truly great computer programmer is lazy, im- 
patient and full of hubris. Laziness drives one to 
work very hard to avoid future work for a future 
self. — Larry Wall 

Programmers, so we read, are lazy — they write libraries 
to capture commonly occurring abstractions so they do not 
have to write them over and over again. The social processes 
that drive programmers to develop libraries have an interest- 
ing theoretical effect. We can view programmers contributing 
to domain-specific libraries as collectively defining a system 



for compressing programs in that domain. If there is a com- 
mon pattern, eventually someone will identify it and put it in 
a library. Since the absence of common patterns in code is 
implied by high entropy, we propose the following principle. 

Principle 1 (Entropy maximization). Programmers develop 
domain-specific libraries that minimize the amount of fre- 
quently rewritten code for the problem domain. This tends 
to maximize the entropy of compiled programs that use li- 
braries. 

As evidence for this principle, we show in Section [6] that 
the rate at which library components are reused is empirically 
observed to approach a maximum entropy configuration. 1 

In practice programmers have to strike a balance between 
the succinctness of their programs and their readability; see, 
e.g., 1111 for an elegant discussion of such tradeoffs. How- 
ever, we maintain that the drive toward terseness and factor- 
ing common patterns is a defining pressure on library devel- 
opment: entropy is essentially a measure of communication 
efficiency, and programmers edge as close to maximum en- 
tropy as they can while maintaining source-code understand- 
ability. 2 

2.4 The Platonic library 

In the early days of computing libraries held a hundred sub- 
routines at most; these days it is common for computers to 
have a hundred thousand subroutines available for reuse (cf. 
Section^. Let us suppose that as time goes on we shall con- 
tinue to add components to our libraries as we discover use- 
ful abstractions and algorithms. Our current libraries might 
be viewed as a truncated version of some infinite (but count- 
able) library toward which we are slowly converging. It is 
convenient to pretend that this limit already exists as some in- 
finite "Platonic library" for the problem domain, and that we 
are merely discovering ever-larger fragments of it, recalling 
Erdos' book of divine mathematical proofs. 3 Were we granted 
access to the entire library, we might write software in a very 
efficient way. We use the Platonic library as a device — a con- 
venient fiction — to reason about how useful finite libraries 
might be. 

Infinite objects need to be treated with care. We shall not 
assume that some "optimal infinite library" exists that is the 
best possible such library. Nor shall we assume there is some 
finite description or computable enumeration of its contents. 
We merely assume that fragments of the Platonic library give 
us snapshots of what shall be in our software libraries over 
time. 

2.5 Existence of reuse rates 

Numerous metrics have been proposed for measuring reuse. 
We focus on the reuse rate of a component, which we write 
A(n) and define as the expected rate at which references are 

^ote that Principle is not intended to appeal to the max- 
imum entropy principle as advocated by Jaynes, which deals 
with maintaining uncertainty in inference. 
2 We re-emphasize that we are speaking of the entropy rate 
of compiled programs; source representations are highly com- 
pressible to support readability. 

3 A Platonic object is an abstract entity thought to dwell in 
some realm outside spacetime. Our stance with respect to 
software libraries echoes mathematical Platonism, that math- 
ematical objects about which we reason exist in some ideal- 
ized form outside the physical universe (see, e.g., 0). 



made to the n th library component in a compressed program. 
The units of A(n) are expected references per bit of compiled 
code. We assume mean reuse rates exist in a problem domain, 
in the following sense. 

Assumption 1. Let Re{s n (w) count the number of references 
to the n th component in a compressed program w of size < s . 
We assume that 

E^Refs n ('w) | ||io|| = si ~ X(n)s + o(s) as so — ► oo 

(1) 

where o(s) denotes some error term growing asymptotically slower 
than s. 

We unfortunately do not have a good sense of how to go 
from the problem domain's distributions p sa on uncompressed 
programs to rates of components in compressed programs; this 
is tied up with the ergodic process issues mentioned in Sec- 
tion 12.21 We dodge the issue by simply assuming that the 
mean rates A(n) exist. This is not a demanding assumption; 
many sensible random process models would imply Assump- 
tion for example modelling component uses as a renewal 
process (see, e.g., 1261 §3]). 4 

2.6 Ordering of library components 

For convenience we shall suppose the library components 
are arranged in decreasing order of expected reuse rate in the 
problem domain: that is, 

A(n) > A(n + 1) 

There are two reasons for this. The first is tidiness, so that 
when we plot A(n) vs n we see a monotone function and not 
noise. The asymptotic bounds we derive on X(n) do not rely 
on this ordering. The second reason is that to derive bounds 
on how well we might compress programs we need to assign 
shorter identifiers to more frequently used components. This 
is easiest to reason about if the Platonic library is sorted by 
use frequency. 5 

3. KOLMOGOROV COMPLEXITY 

Kolmogorov complexity, also known as Algorithmic Infor- 
mation Theory, was founded in the 1960s by R. Solomonoff, 
G. Chaitin, and A.N. Kolmogorov. We shall only make use of 
some basic facts; for a more thorough introduction the survey 
article |20| or the book |19| are recommended. The central 
idea is simple: measure the 'complexity' of an object by the 
length of the smallest program that generates it. This gener- 
alizes to the study of description systems, that is, systems by 
which we define or describe objects, of which programming 
languages, logics, and descriptive set theory are prominent 
examples. The source code of a program, for example, de- 
scribes a program behaviour; a set of axioms describes a class 
of mathematical structures. In the general case we have some 
objects we wish to describe, and a description system 4> that 

4 For readers familiar with coding theory we forestall confu- 
sion by mentioning that the rates A(n) are not the same as the 
usual notion of probabilities over countable alphabets. The 
rates A(n) are drawn from compressed programs and so al- 
ready incorporate code lengths. 

5 Jeremiah Willcock made the useful suggestion that we may 
regard the Platonic library as containing already every possi- 
ble component, and the only question is the order in which 
they are placed. 



maps from a description w (for us, a program) to objects. The 
usual situation is to describe an object by exhibiting a pro- 
gram that generates it; in this case we may also provide some 
inputs to the program, which we shall call parameters. The 
Kolmogorov complexity of an object x in the description sys- 
tem <j>, relative to a parameter y is defined by: 

C</,(x | y) = min{|H| : tf> w (y) = x} (2) 

w 

In the case where the description system is a programming 
language, we may read Eqn. (J^J as finding the shortest pro- 
gram that, given input parameter y, outputs x. The parame- 
ter y does not contribute to the measured description length 
C$(x | y). Without a parameter we have the simpler case 
C^(x) — C^(x | e) where e is the empty string. 

For example, we might choose the programming language 
Java as our description system; then for some string x, its Kol- 
mogorov complexity Cj ava (a;) is the length of the shortest pro- 
gram that outputs x. To determine whether use of a library L 
offers a reduction in program size, we can consider the com- 
bination of Java and the library L as a description system it- 
self which we might call Java + L, and compare Ci sy3 +\_{x) to 

Cjava(:r)- 

A very useful insight is that the choice of language doesn't 
much matter. 

Fact 3.1 (Invariance 1 19 §2.1]). There exists a universal ma- 
chine U such that if 4> is some effective description system (e.g., 
a programming language) then there is a constant c such that 

Cu(x) < C^x) + cfor any x. 

That is, the universal machine U is optimal up to a constant 
factor. For this reason the subscript U can be dropped and one 
can write C(x) for the Kolmogorov complexity of x, knowing 
it is only defined up to some constant factor. 6 

Some strings have very short descriptions: a string of a 
trillion zeros may be produced by a short program. Others 
require descriptions as long as the strings themselves, for in- 
stance a million digit binary string obtained from a physical 
random bit generation device. 7 A recurrent theme in Kol- 
mogorov complexity is that there are never enough descrip- 
tions to go around so as to give short descriptions to most 
objects. In the case where both the objects and their descrip- 
tions are binary strings, we have the following well-known 
result that the probability we can save more than a constant 
number of bits in compressing randomly selected strings is 
zero. 

Fact 3.2 (Incompressibility 1 19 §2.2]). Suppose g : N -> 
N is an integer function with g(n) > and g 6 ^(1), that 
is, linin^oo g(n) = oo. Let x be a string chosen uniformly at 
random. Then almost surely: 

Ct(x)>\\x\\-g(\\x\\) (3) 

Fact 13.21 implies, for example, that one cannot devise a 
coding system that compresses strings by even log log n or 
a -1 (n, n) (inverse Ackermann) bits with nonzero probability. 
The proof of Fact 13. 21 uses counting arguments only, with no 

6 There is an easy way to see why this is true: if is a pro- 
gramming language, then we can write a ^-interpreter for 
the universal machine U. We can then take any program for 
4>, prepend the interpreter, and it becomes a U -program. The 
constant mentioned reflects the size of such an interpreter. 
7 Unless you are rather lucky. 



appeal to computability of the description system. 8 Therefore 
the inequality J3jl applies to any description system <j>, even 
description systems that are not computable. For example, 
Fact l3.2l even applies if we permit ourselves to use an infinite, 
not computably enumerable library as we described in Sec- 
tion !2.4l However, it does not apply in the case where there 
is a nonuniform distribution, as in problem domains where 
H < 1. 

In the remainder of this paper we shall assume compiled 
programs are incompressible in the sense of Fact l3.2l 

Proposition 3.1. Compiled C programs on existing major ar- 
chitectures are almost surely Kolmogorov incompressible. 

Note that "almost surely Kolmogorov incompressible" does 
not imply anything about the compressibility of typical com- 
piled programs for a problem domain. Rather, it means that if 
one chooses a valid compiled program uniformly at random, 
with probability 1 it cannot be replaced by a shorter program 
with the same behaviour. In subsequent sections we investi- 
gate problem domains where there is a nonuniform distribu- 
tion on programs, i.e., H < 1, where the situation is rosier. 

We sketch a proof of Proposition B . 1 1 showing that the num- 
ber of distinct behaviours described by compiled programs 
of s bits grows as ~ 2 s on current machines, which implies 
compiled programs are almost surely (Kolmogorov) incom- 
pressible. The C language has the useful ability to incorpo- 
rate chunks of binary data in a program. For example, the 
binary string z — 0110100111011010 may be encoded by the 
C declaration 

unsigned char z[2] = {0x69, Oxda}; 

Moreover, such arrays are laid out as contiguous binary data 
in the compiled program, so that a binary string of length m 
bytes requires exactly m bytes in the compiled program. We 
can package such an array with a short program of constant 
size that reads the binary string from memory and outputs it 
to the console. Every binary string of rn bytes may be encoded 
by such a compiled program of size at most c+m bytes, where 
c is a constant representing the overhead of a read-print loop. 
Every such program yields a unique behaviour, so the number 
of distinct behaviours of compiled programs of s bits is ~ 2 s . 
We can then adapt the argument used to prove Fact 13. 21 re- 
placing strings by compiled programs, which shows compiled 
C programs are almost surely incompressible. 

Note that uncompiled programs are highly compressible. 
For example, C language source code may not contain cer- 
tain bytes (e.g., control characters) such as the null character 
0x00. This means they can be compressed by a factor of (at 
least) jig ~ 0.39%. Restricting our attention to compiled pro- 
grams is crucial. 9 

8 There are 2 n " 9(n)+1 - 1 descriptions of length at most n - 
g{ri), and 2 n+1 — 1 strings of length at most n. Therefore 
the fraction of strings compressible by g(n) bits is at most 

2 " 2 "+i-i~ 1 > wmcn behaves in the limit as2~ 9(,l) . If g g w(l) 
this value vanishes as n — > oo, so C$(x) > \\x\\ — <?(||£||) 
almost surely. 

9 An alternative would be to deal with indices of programs 
in the usual sense of computability theory, where we equate 
a program with its position in some effective enumeration 
of valid source-language programs. However, working with 
compiled programs has the additional benefit of brushing 
aside issues such as identifier lengths in source code, which 
tend to be unnecessarily long to aid readability. 



4. A BOUND ON REUSE RATES 

In this section we derive a bound on the reuse rate A(n) 
at which the n th library component is reused in 'compressed' 
programs written with use of a library. 

4.1 Coding of references 

We need some rudimentary accounting of what we gain and 
lose by use of the library: we save some by using a library 
component, at the cost of having to refer to it. Let us first 
consider the cost of referring to components. 

We presume that unique identifiers are assigned to library 
components; we call these codewords. Let c(n) be the bi- 
nary codeword for the n th library component, and ||c(n)|| 
its length. Optimal strategies such as Shannon-Fano or Huff- 
man codes assign shortest codewords to the most frequently 
needed components. Since our library is sorted in order of 
use frequency (Section l2.6> . we may presume that ||c(n)|| < 
||c(n+ 1) ||, i.e., codeword lengths are nondecreasing as we go 
down the list of components. 

In what follows we want to make asymptotic arguments, 
and fixing an identifier size (e.g., 64 bits) would lead to wildly 
wrong conclusions. 10 Instead we require that the identifier 
size grows with the number of components, albeit slowly. 
That ||c(n)|| > log 2 n follows from the pigeonhole principle. 
Having identifiers of length only log 2 n leads to difficulties, 
because they are not uniquely decodable. That is, if I am pre- 
sented with a string of such identifiers I have no way to tell 
where one identifier stops and the next starts. (This does 
not arise in current architectures because of fixed word size, 
but as we said, care is needed in asymptotic arguments). A 
more accurate requirement is the following, which draws on 
Kraft's inequality that uniquely decodable codes must satisfy 
£» =i 2-ll°(»>ll < l. 

Proposition 4.1. For identifiers to be uniquely decodable, 

\\c(n)\\ >log+n 

where log + n — log n + log log n + log log log n + ■ ■ ■ and the 
sum is taken only over the positive real terms. 

We omit the proof; see e.g., 1 25 §2.2.4] or §1.11.2] 
(in particular problem 1.11.13). 

4.2 Derivation of reuse rate bound 

We now derive an asymptotic upper bound on the rates 
A(n) at which library components may be reused. We do this 
under the assumption that each time a library component is 
used in a program, the same identifier is used to refer to it, 
i.e., there is no recoding of identifiers. 11 Our argument follows 

10 If we fix memory addresses to be representable in 64 bits, 
then the time to search an acyclic linked list is 0(1) since 
there are at most 2 64 steps the algorithm must go through. 
n There are two reasons for this assumption. (1) On the ar- 
chitectures from which we collect empirical data, there is 
no recoding of identifiers in programs. (2) The reason one 
might want to recode identifiers is to save space by introduc- 
ing shorter aliases for components for use within the program, 
after the initial reference. However, this only saves space if a 
component is more likely to be used again given it is used 
once. While this is intuitively true of real progra ms, it is false 
under a maximum entropy assumption (Section 12.31 1: in an 
encoding that maximizes entropy, the sequence of identifiers 
in a program behaves statistically as if independent and iden- 
tically distributed. 



standard lines [ 24 1 but adapted to coding of library references 
under the model laid out in Section[2] 

Theorem 4.1. Without recoding of identifiers, the asymptotic 
reuse rates A(n) must satisfy A(n) -< (nlognlog + n) _1 . 

Proof. We count the size of the references to library compo- 
nents within compressed programs (i.e., those written with 
use of a library). Consider programs of length at most s. As 
s — > oo, the expected number of occurrences of the n th com- 
ponent tends to \(n)s + o(s) under Assumption^ Referring 
to the n th component requires at least log + n bits (Proposi- 
tion |4.H . We need only consider components whose identifier 
length is less than s, since identifiers longer than the program 
would not fit. Therefore we consider only up to component 
number 2 s since log + 2 s > s. 

The expected total size of all the references to components 
is then at least: 



£ (A(n)a + o(s)) log^ 



The references to components are contained within the pro- 
gram, and therefore their total size must be less than s, the 
size of the program. Therefore we have an inequality: 12 



^ (X(n)s + o(s)) log + n< s 



(4) 



Dividing through by s and taking the limit as s — » oo, 



a i 

lim V - (A(n)s + o(s)) log + n < 1 (5) 

n=l 

Since lim^cx) -jo(s) = by definition, 13 



£A(n)W 



n < 1 



(6) 



We now consider conditions under which this sum converges. 
fSection fA.ll summarizes the asymptotic notations used here.) 
We argue using Proposition lA.il using a diverging series to 
bound the terms of Eqn. (|6j . The simple argument is to note 
that the harmonic series diverges, and therefore the terms of 
Eqn. must grow slower than this, so A(n) log + n -< —, or 
A(n) -< , 1 1 . However, this bound is quite loose. A more 



slowly diverging series is J2 n 



i 



n log n 



Using this, 



A(n) log + n -< 



n log n 



or, 



A(n) -< 



n log n log n 



This completes the proof. 



□ 



(7) 



□ 



12 Inequality Q becomes an equation if we consider programs 
to consist solely of a sequence of component references, with 
no control flow or other distractions. This is possible by build- 
ing components and programs from combinators, which can 
be made self-delimiting |19 §3.2]. This provides a theoreti- 
cally elegant framework, if not entirely intuitive. 



The bound of Theorem l4.1l is not tight. No tightest bound 
is possible using this line of argument since there is no slow- 
est diverging sequence with which to bound a convergent se- 
quence, a classical result due to Niels Abel. However, the 
bound is tight to within a factor n £ for any e > 0. 

Entropy maximization and Zipf's Law. Theorem l4.1| pro- 
vides an upper bound on X(n), but it could well be the case 
that A(n) ~ for example. Why do the curves we see in 
practice (e.g., Figure^ hug the bound of Theorem s-IP We 
believe the answer to why we observe A(n) ~ — is due to 
the tendency of libraries to evolve so that programmers can 
write as little code as possible, which in turn implies evolu- 
tion toward maximum entropy in compiled code (Principle^ . 
The entropy rate of component references is maximized when 
K n ) ~ k (see, e.g., |13|). 



Recall that / e o(g) means lim^oo 



= 0. 



5. REUSE POTENTIAL 

In the following sections we consider the possibilities of 
code reuse in two cases: (1) when H — 1 and we have a 
uniform distribution on programs; (2) when < H < 1 and 
we have some degree of compressibility in the problem do- 
main. The case H — is left for future work. 

5.1 The uniform case: H = 1 

The uniform case of H = 1, in which every program is 
equally likely to be implemented, reduces the scenario to clas- 
sical Kolmogorov complexity with a uniform distribution on 
programs. It has some surprising properties that suggest H = 
1 to be an unlikely scenario for real problem domains. 

Our first result concerns the number of library components 
we might expect to use in a program. Let N(s) be a random 
variable indicating for a program of uncompressed size s the 
number of components whose use reduces program size. Sur- 
prisingly, as program size increases the expected number of 
components that reduce program size is bounded above by a 
constant. 

Theorem 5.1. If H — 1 there exists a constant n crlt indepen- 
dent of program size s such that N(s) < n cr ] t almost surely. 

Proof. Suppose each component used saved at least 1 bit. If 
lirris^oo E[N(s)] were unbounded, use of the library could 
compress random programs by an unbounded amount, con- 
tradicting incompressibility (Fact[!OJ. □ □ 

This has a simple corollary concerning the potential for 
code reuse. 

Corollary 5.1. When H — 1 the expected proportion of a pro- 
gram that can be reused from libraries tends to zero as program 
size increases. 

Because of these results, the case H = 1 is somewhat un- 
interesting and does not seem to model real life, where we 
know libraries are useful and let us reduce the size of pro- 
grams. In the next sections we examine the more interesting 
case of < H < 1, where we can compress programs, even 
ones that are (Kolmogorov) incompressible, by use of a li- 
brary. 

5.2 The nonuniform case: o < H < l 

More interesting than the uniform case is the situation when 
< H < 1, which implies a nonuniform distribution on pro- 
grams. This models problem domains that have some poten- 
tial for code reuse, and libraries are of central importance in 



reducing program size. Recall from Section l2~2'l that we can 

expect to compress programs in such domains from uncom- 
pressed size s to at best Hs by use of a library. A standard 
result from information theory can be adapted to show this 
bound is achievable, at least in a theoretical sense. 

Claim 5.1. There exists a library with which uncompressed pro- 
grams of size s can be compressed to expected size ~ Hs. 

The proof of this is not particularly illustrative and we ban- 
ish it to a footnote. 14 The gist is to place every possible pro- 
gram into the library as a "component," but ordering them so 
that the most likely programs for the problem domain come 
soonest in the library order and thus are assigned the short- 
est codewords. This is a wildly impractical construction but 
demonstrates the claim. In practice we decompose software 
into reusable chunks that we put in libraries; that reusable 
chunks exist suggests an ergodic property (see Section l2.2.1> . 

Unlike the situation of H = 1 where the number of com- 
ponents useful for a program was at most a constant, when 
< H < 1 we have a much more pleasing situation: the num- 
ber of useful components increases steadily as we increase 
program size. 

5.2.1 The incompleteness of libraries 

Under reasonable assumptions we prove that no finite li- 
brary can be complete: there are always more components 
we can add to the library that will allow us increase reuse and 
make programs shorter. To make this work we need to settle 
a subtle interplay between the Kolmogorov complexity notion 
of compressibility (there is a shorter program doing the same 
thing) and the information theoretic notion of compressibility 
(low entropy over an ensemble of programs). Now because 
we defined probability distributions on programs (rather than 
behaviours), we run into the possibility that the probabil- 
ity distribution might weight heavily programs that are Kol- 
mogorov compressible, i.e., the distribution might prefer pro- 
grams w with \\w\\ >> C(w). For example, a problem domain 
might have programs that are usually compressible to half 
their size not because the probability distribution focuses on 
a particular class of problems, but because we artificially de- 
fined p S(J to select only those programs that are twice as large 
as they might be (for example, we might pad every likely pro- 
gram with many nop instructions.) To avoid this difficulty we 
require the distributions be honest in the following sense. 



14 Proof. We first describe an encoding that compresses pro- 
grams to achieve an expected size Hs, and then explain how 
to construct the library. Recall the Shannon-Fano code |19 
§1.11] allows a finite distribution with entropy H to be coded 
so that the expected codeword length is < H + 1. We adapt 
this as follows. For each so 6 N, we produce a Shannon-Fano 
codebook for all programs of length < s achieving average 
cod ewor d size < H{p aQ ) + 1 for the distribution p SQ (Sec- 
tion 12. 21 1. By definition H(p so ) < Hs almost surely, so this 
achieves a compression ratio of H almost surely for each so as 
so — > oo. To combine all the codebooks into one, we preface 
a compressed program with an encoding of its uncompressed 
length, which we use to select the appropriate codebook. This 
can be done by adding to each codeword c + 2 log s bits for 
some constant c, which is negligible with respect to Hs when 
H > 0. Therefore this encoding achieves expected program 
size ~ Hs. We use the codebook as the library: each com- 
ponent identifier is a Shannon-Fano code, each component 
is a program. Note that the reuse rates vanish for this con- 
struction , i.e. , \(n) — > as so — > oo, and so the bound of 
Theorem |4.1| is trivially satisfied. □ 



Definition 2 (Honesty). We say the distributions p so for a 
problem domain are honest if the programs are Kolmogorov 
incompressible. Specifically, 



C(w) 



IMI 



as so 



(8) 



where the expectation is taken over the distributions p Bg . This 
requires that the probability distribution does not artificially 
prefer verbose programs with ||uj|| >> C(w). 

If the distribution for a problem domain is honest and has 
H < 1, the programs are expected to be information-theoretically 
compressible by use of a library, but not Kolmogorov compress- 
ible. In other words, our ability to compress programs is due 
to a "focus" on a class of problems of interest to the domain, 
not just an artificial selection of overly- verbose programs. 

Inspired by Euclid's proof that there are infinitely many 
primes, with the honesty assumption we can prove there are 
infinitely many reusable software components that make pro- 
grams shorter. 

First we need two smaller pieces of the puzzle. 

Lemma 5.1. If H > then for any finite k, Pr(\\w\\ < k) — > 
as so — > oo. 

Proof. We know from definition of H that H(p SQ ) — Hs 
infinitely many times as so — > oo (Section l2.2t . Consider 
how probability must be distributed among programs of dif- 
ferent lengths to account for this much entropy. We try to 
account for as much entropy as we can by short programs, 
setting a uniform distribution p(w) = on the first 2 H "° 
programs — this is the fewest number of programs that would 
produce this much entropy. To programs of length < k we 
can account for 



E 2 ' 



■lo, 



1 



2-f/s & 2-tfso 



-.k+l-Hs 



Hs 



St) 



bits of entropy. But as s — > oo, 2 fc+1_Hs °_ffso — > so we can 
account for none of the entropy by programs of length < k. 
Therefore Pr(||w|| < k) — > as s — > oo. □ 

Lemma 5.2. If H > then E as 

Proof. Suppose E [-p^l = c for some c > 0. Then there 

would be a finite probability that ||iw|| < c _1 as so — » oo, 
contradicting Lemma l5~T1 □ 

Now we are ready for the main theorem, which proves no 
finite library can be "complete" in the sense of achieving a 
compression ratio of H when < H < 1. 

Theorem 5.2 (Library Incompleteness). If a problem do- 
main has < H < 1 and honest distributions (Defn. I2J, no 
finite library can achieve an asymptotic compression ratio of H. 

Proof. Suppose a finite library of components achieves a com- 
pression factor 1 — e, with optimal compression when 1 — e = 
H. Call the programming language <f> and the library L. We 
can write an interpreter for <f> that incorporates the library L; 
since the library is finite this is a finite program. We call the 
resulting machine model <j> + L. Consider Kolmogorov com- 
plexity for this machine, writing C0+l(u>) for the size of the 
smallest ^-program using L that has the same behaviour as 



w. Saying the machine <f> + L achieves the compression factor 
1 — e implies 



IHI 



(9) 



From the invariance theorem of Kolmogorov complexity rFact l3.lt 
we have that there exists a constant c such that 



G(w) < CV +i (w) + c 



(10) 
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for every program w. Dividing through by ||w|| and taking 
expectation, 



(11) 



From honesty E [^y-J — * 1> and from Lemma l5~2l we have 
E ^-[y^j — > 0. Therefore i ll It is, in the limit as s — > oo: 

1 < (l-e) + 

For this inequality to hold, e — > for any finite library. There- 
fore no finite library can achieve an asymptotic compression 
ratio < 1 when the distributions are honest. □ □ 

Claim 15.11 showed that an infinite library can achieve ex- 
pected size ~ Hs; Theorem 15.21 shows that no finite library 
can. Therefore only infinite libraries can compress programs 
of size s to expected size Hs. However, this is an asymptotic 
argument; if we restrict ourselves to programs of size < so 
for some fixed s , finite libraries can approach a compres- 
sion ratio of Hs by including more and more components. 
Doug Gregor suggested calling Theorem l5.2l the Full Employ- 
ment Theorem for Library Writers, after Andrew Appel's boon 
to compiler writers. Theorem 15.21 has a straightforward im- 
plication: no finite library can be complete; there are always 
more useful components to add. In practice we have a trade- 
off between the utility of larger libraries and the economic 
cost of producing them; this suggests the importance of de- 
signing libraries for extensibility. 

A minor change to the above proof yields a similar but 
slightly stronger result. 

Corollary 5.2. Ifaproblem domain has < H < 1 and honest 
distributions, no computably enumerable library can achieve a 
compression ratio of H. 

Proof. Repeat the proof of Theorem 15 .21 replacing "finite li- 
brary" with "c.e. library." In particular the choice of a c.e. 
library guarantees that the interpreter for 4> + L is a finite 
program: whenever a library subroutine is required, it can be 
generated from the program enumerating the library. □ □ 

We may casually equate "not computably enumerable" with 
"requires human creativity." Corollary 15.21 indicates that the 
process of discovering new and useful library components is 
not a process that can can be fully automated. 

5.2.2 Size of library components. 

We now consider how big library components might be. If 
we want to achieve the strong reuse vision of "programming 
by wiring together large components," this suggests that com- 
ponents ought to be quite large compared to the wiring. The 



following theorem sheds light on the conditions when this is 
possible. 

Let S(n) denote the expected amount of code (in bits) saved 
per use of the n th component. We consider the case when 
A(n) ~ n \\c{n)\\f(n) ' wnere l c ( n )ll is the codeword (identi- 
fier) length, and f(n) is a function / 6 o(n e ) for e > that 



ensures convergence (cf. Section I472I I. This coincides with a 
Zipf-style law as observed in practice (Figure^. 

Theorem 5.3. If a library achieves a compression factor of H > 
in an honest problem domain, then S(n) ~ ^-g^ ■ o(n e ) for 
any e > 0. 

Proof. Summing over all components, the total code saved is: 

oo 

(\(n)Hs + o(Hs)) 



expected # uses 



savings per use 



(l-H)s (12) 

total savings 



Dividing through by Hs and taking the limit as s — > oo, and 
substituting A(n) - 1 



"l! c ( n )ll/(") 
1 
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^ n\\c(n)\\f(n 



S(n) = 



H 



H 



Now if S(n) ~ n a for some constant a > then the sum 
would diverge. Therefore S(n) is not polynomial in n; in fact 
for the sum to converge we must have S(n) -< f(n) which 
means S(n) behaves asymptotically as 

S(n) ~ ^—ff-°( n€ ) 

where o(n e ) denotes some subpolynomial function. □ □ 

Note that if the components in the library are unique, then 
S(n) > log n by pigeonhole. 

Strong reuse? The interpretation of Theorem l5.3l is fairly in- 
tuitive. Roughly it says the savings we can expect per compo- 
nent are linear in the size of the component identifier. Which 
is to say, we should expect savings for the n th component to 
grow roughly as log + n. This is consistent with findings in the 
reuse literature that small components are much more likely 
to be reused. The important factor here is the multiplier l^S- . 
As H — + 0, this multiplier becomes arbitrarily large. This sug- 
gests that "strong reuse" (Section^ corresponds to the region 
H — > 0. For example, if programs in a problem domain are 
thought to be solvable by wiring together components that 
are (say) 1000 times bigger than the wiring itself, this sug- 
gests i^S. ~ io 5 or H 0.001. The key result is that whether 
one is able to achieve strong reuse depends critically on the 
parameter H — which measures how much diversity there is 
in the problem domain. 

6. EXPERIMENTAL DATA COLLECTION 

Preliminary empirical data was collected from three large 
Unix installations. The problem domain is not particularly 
well-defined, but is rather "the mishmash of things one wants 
to do on a typical research Unix machine." On the SunOS 
and Mac OS X machines we located every shared object and 
used the unix commands nm or objdump to obtain a listing 
of the relocatable symbols (i.e., references to subroutines in 
shared libraries). For the Linux machine, a more sophisti- 
cated approach was used that involved disassembling every 
executable object and decoding the PLT and GOT tables for 
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Figure 3: Plot of i-^S. versus H, indicating how much code is saved, proportionately, per component use. When H — > 1 
there is almost no reuse; H — » coincides with the "strong reuse" ideal of wiring together large components. In between 
is weak reuse, with moderate amounts of code drawn from libraries. 



shared library calls. For this reason the Linux data is much 
more fine-grained and reliable; for example, our data set for 
Linux includes the frequencies of all the x86 machine instruc- 
tions, in addition to almost a half-million subroutines. 
Operating System # Objects # Components 
Linux (SuSE) 12136 455716 

SunOS 23774 110306 

Mac OS X 2334 37677 

We counted the number of references to each component, 
sorted these by frequency, and this data is plotted in Figure^ 
The observed counts match nicely the asymptotic prediction 
made in Section l4~2l (the family of curves en" 1 is shown as 
dotted lines). To account for machine instructions, which are 
not included in the tally for the Mac OS X and SunOS ma- 
chines but constitute by far the most frequently occurring soft- 
ware components, we started numbering the components for 
these machines at n = 50. Without this adjustment the rates 
have a characteristic "flat top" and then rapidly converge to 
rC 1 lines; this is an artifact of the log-log scale. 

The pronounced "steps" in the data for large n occur be- 
cause there are many rarely-used subroutines with only a few 
references; this is typical of such plots (see, e.g., (54 1). 

Another prediction that may follow from our model is that 
the number of distinct components used in a program should 
approach a normal distribution: under maximum entropy con- 
ditions the use of components is statistically independent, and 
so the central limit theorem applies. This is reminiscent of the 
Erdos-Kac theorem 1 8 1 that the number of prime factors of 
integers converges to a normal distribution. Figure [3] shows 
some preliminary results that support this result, drawn from 
the SuSE Linux data. The number of component references 



have been normalized by an estimated variance of a 2 = cs 2 
where s is the program size. Subfigure (c) shows a sugges- 
tively shaped distribution for the inset box of (a), a region 
where there is good "saturation" of the problem domain with 
programs. 

Our preliminary data demonstrates a Heaps' style law for 
vocabulary growth [14 §7.5]: the number of unique compo- 
nents encountered in examining the first s bytes of the corpus 
grows roughly as a power law s a with a ^ 0.8. We have not 
found a satisfactory theoretical explanation. 

7. CONCLUSION 

We have developed a theoretical model of reuse libraries 
that provides good agreement, we feel, with our intuitions, 
the literature, and the preliminary experimental data we have 
collected on reuse on Unix machines. Much of what we have 
done has served to emphasize the importance of this one 
quantity, H, the entropy rate we associate with a problem 
domain. It determines if we can have strong reuse (H — > 0), 
or whether we can have weak reuse (0 < H < 1), and how 
much code we might be able to reuse from libraries: at most 
1 - H. 

We have shown that libraries allow us to "compress the in- 
compressible," reducing the size of programs that are Kolmo- 
gorov-incompressible by taking advantage of the commonal- 
ity exhibited by programs within a problem domain. We have 
also shown that libraries are essentially incomplete, and there 
will always be room for more useful components in any prob- 
lem domain. 

The arguments made here are quite general and might adapt 




(a) Scatter-plot of the number of subroutine references (c) Distribution for inset box 

Figure 4: Data suggesting a library analogue of the Erdos-Kac theorem, (a) A scatter-plot showing the number of distinct 
library subroutines used vs. software size for the Linux RPM data, (b) Histogram for the number of references, normalized 
(see text), (c) Histogram only for the inset box of (a), illustrating an Erdos-Kac-style normal distribution for the number 
of components used in software. Such plots might provide a useful tool for assessing the extent of reuse vs. ideal 
predictions from a model. 



easily to other description systems, for example, the reuse of 
abstractions, lemmas and theorems in mathematical proofs. 
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APPENDIX 

A. BACKGROUND 

A.l Asymptotics 

Here we recall briefly some facts and notations concerning 
asymptotic behaviour of functions and series. For a more de- 
tailed exposition we suggest 1 10 1. 

Asymptotic notations. For positive functions f(n) and g(n), 
we make use of these notations for asymptotic behaviour: 

f(n) ~ g(n) ^ lim 44 = 1 ^ 13 ^ 
?woo g(n) 

f(n) ~< g(n) ^ lim 44 = ( 14 ) 
n— >oo g(n) 

f(n) ■< gin) <^ 3ceR. lim 44 = c (15) 

n^oo g(n) 

The "big-O" style of notation / £ o(g) is equivalent to f(n) -< 
g(n). When we write h(n) ~ g(n) + o(n 2 ) we mean there 



exists some function / 6 o(n 2 ) such that h(n) ~ g(n) + f(n). 

Series and their convergence. A series YltLi a » * s conver- 
gent when limjv-,00 J2iLi a i exists in the standard reals; oth- 
erwise it is divergent. The Harmonic series h n = - is diver- 
gent, since J2iLo ' l i = l + 5 + i + '" fails to converge. 

We shall make use of the following key fact for bounding 
convergent sequences. 

Fact A.l. Let a ni b n be positive sequences. If J2^Li a » con ~ 
verges and I]^=i b n diverges, then a n -< b„. 

Proposition lA. H is useful to establish a bound on the asymp- 
totic growth of a sequence: for example, if J^^Li a « must 
converge, then o n -< — since the harmonic series diverges. 



